Remaining useful life prognosis of turbofan engines based on deep feature extraction and fusion

In turbofan engine datasets, to address problems, such as noise interference, diverse data types, large data volumes, complex feature extraction, inability to effectively describe degradation trends, and poor remaining useful life (RUL) prognosis effects, a remaining useful life prognosis model combining an improved stack sparse autoencoder (imSSAE) and an improved echo state network (imESN) is proposed in this paper. First, the 3-sigma criterion is adopted to remove the noise and reconstruct the data, and then the deep features of the engine are extracted by using an imSSAE and fused into health indicator (HI) curves describing the engine degradation trend. Finally, an attention mechanism is introduced into an imESN to adaptively process different types of data and obtain the RUL. The experimental results based on the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset show that compared with the other popular RUL prediction models, the combined model proposed in this paper has higher prediction accuracy, and the evaluation indices also show the effectiveness and superiority of the model.

www.nature.com/scientificreports/ deep feature processing. The following is the reasoning process of the imSSAE. First, the encoding and decoding process of the SAE can be expressed as follows: where S_encoder is the encoding activation function;S_decoder is the decoding activation function; H is the activation value of the hidden layer, X is the input sample data, and X is the output.b 1 , b 2 are the offsets of the encoding and decoding parts, respectively; and W 1 , W 2 are the weights of the encoding and decoding parts, respectively. The sparse auto encoder, like the auto encoder, applies back propagation and gradient descent algorithms to update and iteratively optimize the weight and offset in the encoder; then, the mean square error function is minimized. The function is expressed in Eq. (3), where h(·) is a hidden layer function, which approximates an identity function and can also make the input close to the output; The goal of the auto encoder is to make the learning function satisfy h(X) ≈ X , that is, the output X is approximately equal to the input X and where (x (i) , y (i) ) corresponds to the input and output, respectively, of the i − th sample.
In fact, the setting of the activation function will affect the result of the hidden layer, the output layer, and the mean square error function. The activation functions generally adopt the sigmoid function. When the input value of neurons is far from the zero point, the derivative value of sigmoid will become very small, almost zero. For the turbofan engine dataset, an excessively large number of samples will result in a slow convergence of the model and the problem of gradient disappearance. To solve the above problem, we replace the sigmoid function with the Tanh function in the encoding and decoding parts. the Tanh function converts the input value between −1 and 1 ; the derivative ranges from 0 to 1 , rather than the range of sigmoid from 0 to 1/4 ; and the mean of Tanh is 0 . To a certain extent, compared with sigmoid , Tanh delays the period of saturation. In addition, Tanh works better when features are distinctly different; therefore, the Tanh function is applied as the activation function of each sparse auto encoder used in this paper.
During network training, the distribution of input values will be biased as the depth increases and will easily result in gradient disappearance during back propagation. To avoid gradient disappearance, a batch normalization (BN) layer is added in front of each hidden layer of the SSAE so that the encoder forms an "input layer-BN layer-hidden layer" structure. After adding the BN layer, for each neuron of the hidden layer, the input distribution gradually approaches the limit saturation area of the value range after mapping to the nonlinear function and is forcibly pulled back to the standard normal distribution with a mean of 0 and a variance of 1, thus making the input of the nonlinear transformation function fall into the sensitive area, accelerating model learning and simplifying the network parameter adjustment. The processing formula of the BN layer is as follows: www.nature.com/scientificreports/ where x i l is the i−th input data of the l−th layer; i = 1, 2, . . . , n refers to the number of data passed from the previous layer to the next layer; µ is the mean of the input data; σ 2 is the variance of the input data; x i l is the normalized value of the data; ε is a small constant term; γ and β are the training parameters of the network; and BN is a new value through the linear transformation of γ and β after standardization.
In addition to the problem of gradient disappearance, the traditional SAE adds KL dispersion as a regular term to the loss function to constrain the sparseness of the entire network. However, the turbofan engine dataset studied in this paper is preprocessed, and the deep effective features of the dataset are extracted. Then, normalization limits the deep effective features and the subsequent HI to a range between [0, 1]. The KL dispersion as the sparse constraint of the SAE usually applies to the classification problem with a true value of 0 or 1. Therefore, the average activity ρ j cannot be used as a basis to punish the network. To solve this problem, this paper adds a dropout layer to realize the sparsity of the SAE. Similar to the BN layer, the dropout layer is introduced before the activation function during the encoding and decoding of the hidden layer. The dropout mechanism can randomly inactivate the neuron activation value during encoding and decoding (i.e., set a certain probability p to 0) to achieve the sparsity constraint, which can be written as: where I is the input of the original activation function, I ′ is the input of the activation function processed by the dropout layer, and Bernoulli( · ) represents the Bernoulli distribution. When the activation value is set to 0, the weight and offset of the corresponding neuron have not been updated in learning, and the neuron does not affect the original encoding and decoding process. The structure of the imSSAE is shown in Fig. 2. C. HI construction. Currently, most RUL prediction models use the objective function to mark the real RUL value, that is, by inputting the sensor data into the neural network to directly predict the RUL, thus leading to different RUL values for the same health level under the objective function. To solve this problem, this paper adopts engine features to construct an HI curve to describe the degradation trend of the engine. Specifically, after feature extraction by the imSSAE, a set of one-dimensional eigenvalues of each engine is obtained, and each eigenvalue represents the multisensor data information in the engine life cycle, which can also distinguish the fault and degradation state in the data. During the data training, according to the one-dimensional eigenvalues, the real-time state features, historical degradation features and engine failure features in the engine life cycle are obtained. The engine HI constructed in this paper is shown in the following.
where T is the length of the HI curve, F t is the real-time feature of the t − th life cycle, and F end is the failure feature of the engine. During construction, an HI value of 0 indicates that the engine has completely failed, and an HI value of 1 indicates that the engine is normal. The HI degradation state is updated and limited to [0, 1] by Eq. (8). HI and HI ′ are the HI values before and after the update, respectively; and HI max and HI min are the maximum and minimum HI values, respectively.
To quantitatively evaluate the effect of the HI curve, time relevance and monotonicity are proposed as evaluation indices in this paper. The time correlation of the HI curve of the i−th engine unit is shown in Eq. (9). where dQ i is the derivative of the sequence value in the i − th engine HI curve, Num of dQ i > 0 represents the number of dQ i values greater than 0, and Num of dQ i < 0 represents the number of dQ i values less than 0.
D. imESN. The traditional ESN network regards the elements of the input layer as an entirety of the same type, but in fact, the input of the ESN network in most cases consists of different elements representing different features. Taking the dataset of the turbofan engine studied in this paper as an example, the input of the neural network is the engine's HI curve value, which essentially represents the different features of the engine. Therefore, we improved the traditional ESN. The attention mechanism is added to the ESN network in this paper to adaptively process the various features of the engine we extracted and to ensure that the elements are completely input into the neural network and that the correct result is outputted. The attention mechanism is defined in Eq. (11): where the output d(t) of the attention mechanism is a vector whose dimension is consistent with the input layer state u(t) at time t ; ϕ(·) is the activation function, and the activation function in this paper is tanh; W in is the connection weight between the input layer and the reservoir; Ŵ is the state feedback weight of the reservoir; and b d is the offset of the attention mechanism. The previous output state x(t − 1) of the reservoir and the input layer state u(t) at time t are used to determine the importance of each feature of the input layer. After the attention mechanism is updated, the original input becomes Eq. (12).
where û(t) is the state of the new input and ⊙ refers to elemental multiplication. When the new input replaces the original input, the state of the imESN reservoir will also change and will be updated, as shown in Eq. (13) where x(t) is the state of the reservoir at time t , is the leakage rate with the range of [0, 1] , W back is the weight matrix of the input and output feedback, f (t − 1) is the state of the output layer at time (t − 1) , and η is the regularization coefficient. The structure of the imESN is shown in Fig. 3.
To improve the prediction accuracy of the imESN, the internal parameters of the imESN need to be optimized to make the model achieve the best status. The improved particle swarm optimization algorithm is adopted to optimize the parameters of the imESN, as shown in Algorithm 1.  to N Update the velocity i V and position i X of the particle i , and evaluate the particle i . 3. Update rules: (1) () rand is a random number between (0,1) ; is the inertia factor, its value is nonnegative, and it is related to the maximum number of iterations G ; the larger the value, the stronger the global optimization ability; and the smaller the value, the weaker the global optimization ability. 4. In each iteration, the particle updates itself by tracking two "extreme values" ( , ) i pBest gBest , calculates the loss through update, and then accumulates the loss as the fitness value. 5. Evaluate the fitness value of the particles and update the historical optimal position of each particle.
IF 2. The sample data is sequentially loaded to the input and output to initialize the reservoir state.
3. Update the input by using the attention mechanism, and obtain the new input state ˆ( ) u t . 4. Sampling: Set the initial state of the network as 0.
(1) The stateˆ( ) u t is added to the reservoir through the input connection weight matrix in W . (2) Calculate the system state ( ) x t , and output ( ) f t . 5. Calculate the output connection weight matrix out W , and make it meet the minimum mean square error. 6. Update the parameters by using a particle swarm optimization algorithm to obtain the best parameters. 7. Record the state of the reservoir, and use linear regression to determine the output connection weight matrix out W . Output: the imESN network with a fixed weight and a RUL corresponding to each engine unit.  Fig. 4. The C-MAPSS dataset is a widely used benchmark dataset in RUL prediction; the dataset includes 4 sets of monitoring data (FD001-FD004) under different running conditions of the engine. Each group of monitoring data contains a training set, a test set and an RUL set. Each engine in this dataset has different degrees of initial wear, which is unavailable to users, and much random noise is mixed in the dataset. It can be observed that the engine may fail, and the operating data of the engine stops over time, at this moment, the sensor captures the degradation. There are 21 sensors (such as the total temperature sensor of the fan inlet), 3 running parameters (i.e., flight altitude, Mach number, and throttle resolver angle), engine identification and cycles of each engine. The sensors can capture the engine performance degradation; this ability is the key to realizing RUL prediction. The specific description of the sensors is shown in Table 1.

Scientific Reports
FD001 has only one failure mode and one running condition; these characteristics are convenient for analyzing the degradation of the engine. To compare with the existing research, this paper selects FD001, the first set of monitoring data, to verify the method we proposed. In the training set, FD001 recorded the state monitoring data of 20,632 running cycle samples of 100 engines from normal operation to complete failure, and in the test set, the dataset recorded the state monitoring data of 13,097 running cycle states of 100 engines that were terminated at some point before the failure occurred. The purpose of this paper is to predict the number of remaining running cycles (RULs) of the engine before failure in the test set.  www.nature.com/scientificreports/ B. Data preprocessing. Because the original data contain considerable random noise, the FD001 dataset needs to be preprocessed before being input into the model; preprocessing ensures the validity of the data and reduces the experimental error. The FD001 data distribution is almost concentrated in the interval of (μ − 3σ, μ + 3σ), and the proportion is 99.73%. The rest of the data exceeding this interval is 0.27%. This part of the data belongs to the gross error and is considered the noise of the original data. In this paper, the 3-sigma criterion is adopted to reduce the noise of the original data. The 3-sigma criterion can remove the monitoring error and avoid the influence of data error on the prediction accuracy; compared with other methods, it has no impact on the original data and ensures the authenticity of the input data. Assuming that multiple measurement data have the same accuracy and distribution, the gross error in the measurement data can be eliminated by using the 3-sigma criterion. The 3-sigma criterion is expressed in Eq. (14).
where µ is the average value of the sample and σ represents the standard deviation. The monitoring of the turbine engine is completed by the cooperation of multiple sensors, and the monitoring data of the sensors directly reflect the degradation of the turbine engine. Therefore, we perform denoising on 21 sensor signals of all engine operating cycles in the dataset. The monitoring data on each running cycle in the dataset conform to a Gaussian distribution. As mentioned above, 3-sigma is used for denoising, and multiple data samplings are performed under the same conditions to eliminate the error and noise of each group of monitoring data. After denoising, the trend of sensor monitoring data in all engine life cycles is analyzed, and the degradation data are obtained. According to the experimental data analysis, the change trend of the sensor signal with the engine degradation can be divided into four types: monotonic rise, monotonic decline, irregular change and constant value. It is shown in Fig. 5.
Since some sensor signals are constant during degradation, they do not provide useful information for prediction. Therefore, we delete such data during the data processing stage, and to avoid affecting the reliability of the prediction results, we clean up and reorganize the FD001 training and test sets. Only 14 types of sensor signals are retained to form a new training set and a new test set. Table 2 summarizes the sensor trend information.
For the new FD001 dataset, to eliminate the dimensional influence between sensors and improve the convergence speed of the model, we use the normalization method to process the original monitoring data of the new dataset and limit the data size to the range of [0, 1].
In Eq. (15), X * is the value of each sensor after the normalization at each running cycle of the engine; X is the original sensor data before X * is processed; and X max and X min are the maximum and minimum, respectively, of each sensor in each engine running cycle.
C. Parameter settings. The performance of the model proposed in this paper is greatly affected by the parameters during training. To achieve the best performance, we need to select the best combination of parameters to improve the robustness of the model; the parameter settings mainly include those of the imSSAE and imESN. The fivefold cross-validation method is applied to train models with different parameters. The structure of the imSSAE is 14-8-4-1, which means that the input nodes are 14 (the 14-dimensional sensor data value of each running cycle), the nodes in the two hidden layers are 8 and 4, the output node is 1, and the output eigenvalue is adopted to construct the HI curve. In the imESN, we employ a particle swarm optimization algorithm to optimize the parameters and apply fivefold cross-validation for verification, and each cross-validation is run twice to alleviate the problem of overfitting. In this algorithm, the learning factors c 1 and c 2 are both 1.49, the inertia factor σ is 0.8, the population size is 200, and the maximum number of iterations G is 50. Additionally, this paper uses the RMSE value of the model as an evaluation index to adjust the parameters and improve the prediction effect. The two model parameter combination settings are shown in Table 3.

D. Experiments result analysis.
During training, we input the 14-dimensional sensor data of each running cycle into the encoder of the imSSAE for feature extraction so that each running cycle of the engine obtains one feature. Then, the HI value of each running cycle is constructed according to the features, and finally, the HI curve of each engine is obtained. Figure 6 depicts the HI curve of 100 engines. To ensure that the characteristics of the original HI curve remain unchanged, fitting and smoothing are performed. We randomly choose 10 from 100 engines for the experiment, and the curves of the HI training set and test set are shown in Figs. 7 and 8.
As depicted in the above figures, the HI curve can well describe the degradation process of the engine and has good monotonicity. To better measure the construction effect of the HI curve, the HI curve evaluation index is used to compare the commonly used method of multisensor degradation state modeling proposed in references 19,20 and the method proposed in this paper. The results are shown in Table 4. The values are the average index of 100 HI curves and 100 engines. Table 4 shows that the method proposed in this paper is superior to other methods in terms of time relevance and monotonicity mainly because our method performs noise reduction and reorganization of the dataset before data processing, thereby leading to better monotonicity and little interference. In contrast, PCA and ELM_AE have monotonic inconsistencies due to noise interference, thus resulting in monotonic reduction. The high time correlation indicates that the ability to capture potential degraded features is strong. In summary, the feature extraction and HI construction of the method proposed in this paper can better reflect the engine's degradation process. www.nature.com/scientificreports/ Referring to the current popular literature 21,22 , to analyze the RUL of each engine unit, a piecewise linear degradation model is proposed to determine the target RUL in this paper. The operation state of the engine can www.nature.com/scientificreports/ be considered healthy in the initial stage, and the degradation will be obvious after a period of runs or after the engine is used to a certain degree; that is to say, the normal working state of the engine is a constant value, and after a certain period, the RUL of the engine will linearly decrease. In this paper, the initial constant RUL value of the observation data in the degradation stage is 125, and the RUL prediction results of the four monitoring units (24,34,82, and 100) are randomly selected, as shown in Fig. 9. The figure shows that the early RUL prediction is close to a constant value, thus indicating that the engine is in a healthy state, and as the running cycle increases, Table 2. Sensor trends summary.

Trend Sensor number
Increasing [2,3,4,8,11,13,14,15,17 Decreasing [7,12,20,21] Irregular [9,14] unchanged [1,5,6,10,16,18,19]    www.nature.com/scientificreports/ the fault emerges. The prediction result basically shows a linear decline until failure occurs, and some engines (such as 82) have a poor initial prediction effect due to the insignificant initial degradation trend. As the cycle increases, the prediction error decreases. The prediction effect of the engine is best when it is close to failure, thus showing that the features at this time are the most prominent. Figure 10 shows the comparison results of the true RUL and the predicted RUL of each engine in the test set. To measure the prediction performance of the model more comprehensively, we select the latest and popular life prediction method to compare the evaluation indices of our method and other methods on the same dataset. These methods are evaluated based on RMSE and score, the calculation formula is show in Eq. (16) and Eq. (17). In order to enhance comparability, the same evaluation index is adopted in this paper; Experiments show that SAE extracts features and maps them from high dimension to low dimension. If the dimension gap is too large and the hidden features are inaccurate, the experimental results are not ideal, however, imSSAE can extract the features layer by layer in the form of stacking, improve the training efficiency, and the final hidden features can better express the original data. ESN without attention mechanism cannot adaptively process different features, and cannot process the data set in this paper. ESN with attention mechanism can adaptively process features, and optimize and update the internal parameters of ESN network, which can make the internal parameters of the network reach an optimal state and improve the prediction accuracy. In addition, the proposed method is also compared with others; the results are shown in Table 5.
Root mean square error (RMSE): this measurement standard is used to measure the deviation between the predicted value and the real value. It is a common evaluation index, which provides a consistent prediction weight for the front and rear periods of turbofan engine. The calculation formula is as follows: where n is the sample number of turbofan engine test data; y i refers to the true value of RUL of the i th turbofan engine, y i is the predicted value of RUL of the i th turbofan engine.
Scoring function (SF): RMSE has the same punishment for early prediction and later prediction in terms of prediction. Different from RMSE, the scoring function is more inclined to early prediction (RUL predicted value is less than the actual value) than later prediction. The scoring function applies larger punishment for overestimated value and smaller punishment for underestimated value. The scoring function is calculated as follows:  www.nature.com/scientificreports/ where SF represents the score value and n is the number of samples. The lower the value of SF, the better the effect of the model. Compared with other methods, the method proposed in this paper has the minimum of the two scoring indicators, namely, RMSE and score; specifically, the RMSE obtained by the proposed method is 39.5-75% lower than that obtained by traditional machine learning methods, such as the MLP, SVM, and RF; and the score obtained by the proposed method is 59% lower than that obtained by the RF. The method proposed in this paper obtains these results because the neural network structure of the method can adaptively extract different features  www.nature.com/scientificreports/ from engines and optimize the network parameters to improve the prediction effect of the model. Compared with the RMSE and score obtained by the hybrid neural network structure (such as the DBN, CNN-LSTM, and HDNN), the RMSE and the score obtained by the proposed method is reduced by 22.1-33.3% and 19.5-52.8%, respectively, because the encoder of the imSSAE can obtain low-dimensional and more effective features and the imESN can reasonably construct the HI curve to improve the prediction effect. Compared with the RMSE and the score obtained by the autoencoder-BLSTM, the RMSE and score obtained by the proposed method is reduced by 12.5% and 25.6%, respectively; and compared with the RMSE and the score obtained by the VAE-D2GAN with encoder structure, the RMSE and score obtained by the proposed method is reduced by 10.8% and 24.5%, respectively. These results show that the data preprocessing used in this paper before feature extraction is more advantageous in summary, the prediction accuracy of this method on the C-MAPSS dataset is improved, and this method can be used to predict the maintenance time of the turbofan engines without expert knowledge and physical knowledge.

Conclusions
This study is intended to present an efficient and accurate method for RUL prognosis of turbofan engine. In this work, a deep feature extraction based method is proposed to automatically extract features with little prior knowledge and construct HI curve, additionally, facing the phenomenon of different feature information existing in the input data, an attention mechanism was presented to improve the performance of the RUL prognosis, a turbofan engine data set FD001 was adopted to demonstrate the effectiveness of the proposed methodology. Compared with other methods, the proposed method has obvious advantages in prediction accuracy. However, there is still room for improvement in predicting the remaining useful of turbofan engine, and further research is still needed. In this study, it has been assumed that samples in the data set are balanced; imbalanced sample will affect the prediction effect, the current approach shall be modified to consider the imbalance in data set. In addition, the failure threshold in the experiment is given in the data set, how to adaptively calculate the failure threshold according to different data sets is the subject of a new research.